Recap
Interpretation
Interpreting simple scenarios:
Matrix equation fits mean of \(Y\) as a linear function of many variables:
\[Y_i = b_0 + b_1 X_{1i} + b_2 X_{2i} + \ldots + b_k X_{ki}\]
\(\hat{Y_i}\) is the predicted mean of \(Y\) given values of \(\mathbf{X}_i\)
We obtain coefficients \(b_0, b_1, \dots, b_k = \beta\):
\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]
This is the matrix formula for least squares regression.
If \(X\) is a column vector of \(1\)s, \(\beta\) is just the mean of \(Y\). (We just did this)
If \(X\) is a column of \(1\)s and a column of \(X_1\), it is an intercept and slope on \(X\).
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]
\[\mathbf{X}\beta = \begin{pmatrix} 1 & x_{11} & x_{21} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \hat{Y}\]
The mathematical procedures we use in regression ensure that:
\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\) or \(b_0\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.
The mathematical procedures we use in regression ensure that:
\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.
The mathematical procedures we use in regression ensure that:
\(3\). \(\overline{Y} = \overline{X}'\beta\): the predicted value \(\hat{Y}\) when all variables in \(X\) are at their mean is the mean of \(Y\).
The mathematical procedures we use in regression ensure that:
\(4\). Slope \(b_k\) captures change in \(Y\) due to variation in \(X_k\) that is orthogonal/uncorrelated to all other columns of \(X\).
\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]
\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)
where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:
\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{j} X_{j}\)
\(X_k^*\) is the residual from regressing \(X_k\) on all other \(X_{j \neq k}\)
How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?)
\[E(Y_i | X_i) = f(X)\]
Conditional: mean of \(Y\) is different at different levels of \(X\)
Expectation: expected value/mean of \(Y\)
Function: mathematical function links values of \(X\) to mean of \(Y\). Could be, does not need to be, linear
How does multivariate regression relate back to Conditional Expectation Function?
## `geom_smooth()` using formula 'y ~ x'
Least Squares predictions \(\widehat{Y_i} = X_i'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the smallest mean-squared error (minimum distance to the true \(E(Y_i | X_i)\)).
Exercise: Download the data. Contains mean income for each group of people with the same AGE in years
https://www.dropbox.com/s/gyvv0mgvfic9jv7/acs_grouped.rda?dl=0
Exercise:
Use matrix calculations in R to obtain the linear approximation of \(E(Y_i | X_i)\):
#With individual data...
X = cbind(1, acs_data$AGE)
y = acs_data$INCEARN
beta = solve(t(X) %*% X) %*% t(X) %*% y
beta## [,1]
## [1,] 17501.992
## [2,] 3668.629
Do your estimates agree? Why or why not?
We are not accounting for the number of people in each value of \(X_i\)
#With aggregate data...
X = cbind(1, cef$AGE)
y = cef$INCEARN
w = cef$respondents
w = w * diag(length(w))
beta = solve(t(X) %*% w %*% X) %*% t(X) %*% w %*% y
beta## [,1]
## [1,] 17501.992
## [2,] 3668.629
We can reproduce any individual-level least squares by:
We are directly modelling the conditional expectation of \(E(Y_i | X_i)\) as a linear function.
Least Squares estimates the mean of \(Y\) as a function of values of \(X\).
What are they?
Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise.
Can be used for variables with many mutually exclusive categories (e.g. race, marital status, etc.)
How do we make them?
R with as.factor(): each unique value will get a dummy.lm(y ~ as.factor(any_variable))
## (Intercept) FEMALE
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 1
## 5 1 0
## 6 1 1
## (Intercept) FEMALE
## 201941.70 -57269.02
## (Intercept) FEMALE MALE
## 1 1 0 1
## 2 1 0 1
## 3 1 0 1
## 4 1 1 0
## 5 1 0 1
## 6 1 1 0
## (Intercept) FEMALE MALE
## 201941.70 -57269.02 NA
## FEMALE MALE
## 1 0 1
## 2 0 1
## 3 0 1
## 4 1 0
## 5 0 1
## 6 1 0
## FEMALE MALE
## 144672.7 201941.7
What do they do?
If there is an intercept
R will do this by default)If there is no intercept:
How did different regions in the UK vote on Brexit?
brexit data: https://www.dropbox.com/s/b5t5aao5frjpsw6/brexit.rda?dl=0table function to find out how what the different Region names arehist function to visually examine the distribution of the leave vote (Pct_Lev)Using the lm function:
Pct_Lev on Regionsummary(your_model))Pct_Lev on -1 + Regionyour_model$fitted.values, what can you say about the \(\widehat{Y}\) produced by these two regressions?##
## East East Midlands London
## 47 40 33
## North East North West Northern Ireland
## 12 39 1
## Scotland South East South West
## 32 67 37
## Wales West Midlands Yorkshire and The Humber
## 22 30 21
##
## Call:
## lm(formula = Pct_Lev ~ RegionU, data = brexit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.8121 -4.7843 0.4257 5.1457 30.5685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.13687 1.39400 28.075 < 2e-16 ***
## RegionULondon -0.04536 1.95643 -0.023 0.982
## RegionUNorth West 16.77825 1.88088 8.920 < 2e-16 ***
## RegionUYorkshire and The Humber 19.51312 2.21458 8.811 < 2e-16 ***
## RegionUNorth East 20.34229 2.66931 7.621 2.15e-13 ***
## RegionUWest Midlands 21.17779 2.00400 10.568 < 2e-16 ***
## RegionUEast Midlands 20.43763 1.87025 10.928 < 2e-16 ***
## RegionUSouth West 14.54745 1.90365 7.642 1.87e-13 ***
## RegionUEast 17.82525 1.80729 9.863 < 2e-16 ***
## RegionUSouth East 13.03312 1.69451 7.691 1.34e-13 ***
## RegionUWales 14.21085 2.18398 6.507 2.50e-10 ***
## RegionUNorthern Ireland 5.08312 8.00793 0.635 0.526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.886 on 369 degrees of freedom
## Multiple R-squared: 0.443, Adjusted R-squared: 0.4264
## F-statistic: 26.68 on 11 and 369 DF, p-value: < 2.2e-16
How is this different from what you produced?
During the Brexit vote, many areas thought to support “Remain” experienced heavy rainfall. total_precip_prepolling is the number of mm of rain that fell just before polls were open.
Pct_Lev on total_precip_prepolling + Region.total_precip_prepolling on calculated in this case? (think of \(X_k^*\))total_precip_prepolling mean?##
## Call:
## lm(formula = Pct_Lev ~ total_precip_prepolling + Region, data = brexit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.7877 -4.7351 0.4749 4.7649 28.7536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.55803 1.49590 35.135 < 2e-16 ***
## total_precip_prepolling 0.05186 0.01165 4.451 1.13e-05 ***
## RegionEast Midlands 7.01178 1.92738 3.638 0.000314 ***
## RegionLondon -19.26625 1.77485 -10.855 < 2e-16 ***
## RegionNorth East 6.92113 2.67736 2.585 0.010120 *
## RegionNorth West 3.35710 1.93773 1.732 0.084024 .
## RegionNorthern Ireland -8.33820 7.83610 -1.064 0.287993
## RegionScotland -13.42789 2.02081 -6.645 1.10e-10 ***
## RegionSouth East -3.11209 1.51142 -2.059 0.040193 *
## RegionSouth West 1.11001 1.95693 0.567 0.570911
## RegionWales 0.78957 2.21970 0.356 0.722260
## RegionWest Midlands 7.75486 2.05162 3.780 0.000183 ***
## RegionYorkshire and The Humber 6.09083 2.24826 2.709 0.007061 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.692 on 368 degrees of freedom
## Multiple R-squared: 0.4715, Adjusted R-squared: 0.4542
## F-statistic: 27.35 on 12 and 368 DF, p-value: < 2.2e-16
\(2\): We regress Percent Leave on the residual pre-election precipitation.
The other variable are dummies for region. Thus, the \(\widehat{precipitation_i}\) is the mean precipitation in each region.
So, residual \(precipitation_i^*\), is difference between precipitation in a consituency and the mean precipitation in the region.
You can change the “reference group” like this. This code shifts “Scotland” to the reference by making it the first “level” in the factor.
levels = c('Scotland', 'London', 'North West', 'Yorkshire and The Humber', 'North East', 'West Midlands', 'East Midlands', 'South West', 'East', 'South East', 'Wales', 'Northern Ireland')
brexit[, RegionU := factor(Region, levels = levels)]
Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.
Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.
UHRSWORK (the usual hours worked per week)FEMALE (an indicator for female)AGE (in years)LAW (an indicator for being a lawyer)INCEARN (total annual income earned in dollars).Let’s now find the (approximate) linear conditional expectation function of earnings, across hours worked, age, profession, and gender:
##
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -304486 -90963 -29238 72845 743690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11083.5 9484.6 1.169 0.243
## UHRSWORK 1075.5 114.7 9.373 <2e-16 ***
## AGE 3452.8 125.2 27.578 <2e-16 ***
## LAW -48919.9 2674.6 -18.291 <2e-16 ***
## FEMALE -37223.1 2799.8 -13.295 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 127800 on 9995 degrees of freedom
## Multiple R-squared: 0.146, Adjusted R-squared: 0.1457
## F-statistic: 427.2 on 4 and 9995 DF, p-value: < 2.2e-16
How do we make sense of, e.g., the slope on FEMALE?
## `geom_smooth()` using formula 'y ~ x'
Previous slide shows linear approximation of CEF of earnings by age.
In your small groups:
Can you think of a way to use least squares to better approximate the true conditional expectation function of earnings by Age?
How would you do it?
Try it out
How would you incorporate this into the example above? ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)?